JUMP: replicability analysis of high-throughput experiments with applications to spatial transcriptomic studies

Abstract Motivation Replicability is the cornerstone of scientific research. The current statistical method for high-dimensional replicability analysis either cannot control the false discovery rate (FDR) or is too conservative. Results We propose a statistical method, JUMP, for the high-dimensional replicability analysis of two studies. The input is a high-dimensional paired sequence of p-values from two studies and the test statistic is the maximum of p-values of the pair. JUMP uses four states of the p-value pairs to indicate whether they are null or non-null. Conditional on the hidden states, JUMP computes the cumulative distribution function of the maximum of p-values for each state to conservatively approximate the probability of rejection under the composite null of replicability. JUMP estimates unknown parameters and uses a step-up procedure to control FDR. By incorporating different states of composite null, JUMP achieves a substantial power gain over existing methods while controlling the FDR. Analyzing two pairs of spatially resolved transcriptomic datasets, JUMP makes biological discoveries that otherwise cannot be obtained by using existing methods. Availability and implementation An R package JUMP implementing the JUMP method is available on CRAN (https://CRAN.R-project.org/package=JUMP).


B Comparison methods overview
In the simulation study, We compared JUMP to several statistical methods for replicability analysis methods (ad hoc BH, naïve MaxP, IDR, MaRR and radjust) and two p-value combination methods for meta-analysis. Let (p 1i , p 2i ), i = 1, . . . , m denote the paired p-values from two studies. We review these comparison methods as follows.

B.1 The Ad hoc BH method
BH [1] is the most popular multiple testing procedure that conservatively controls the FDR for m independent or positively correlated tests. In study j, j = 1, 2, the BH procedure proceeds as below: • Step 1. Let p j(1) ≤ p j(2) ≤ · · · ≤ p j(m) be the ordered p-values in study j, and denote by H (j) (i) the null hypothesis corresponding to p j(i) ; • Step 2. Find the largest i such that p j(i) ≤ i m α, i.e.,k = max{i ≥ 1 : p j(i) ≤ i m α}, andk = 0 if the set is empty; The ad hoc BH method for replicability analysis identifies features rejected by both studies as replicable signals.

B.2 The naïve MaxP method
Define the maximum p-values as As discussed in the paper, q i follows a super-uniform distribution under the replicability null. The naïve MaxP method directly applies BH [1] to q i , i = 1, . . . , m for FDR control of replicability analysis.

B.3 The radjust procedure
The radjust procedure [2] works as follows, • Step 1. For a pre-specified FDR level α, compute where S j is the set of features pre-selected in study j for j = 1, 2. By default, it selects features with p-values less than or equal to α/2.
• Step 2. Declare as replicated the features with indices in the set This procedure gains power by pre-filtering irrelevant features. It looks very similar to ad-hoc BH procedure where the BH procedure is implemented for each study and the intersection of significant findings are regarded as replicable features. After close inspection, we find these two procedures are quite different. We use the following toy example to illustrate the difference between ad hoc BH and radjust procedure.

B.4 The IDR procedure
The IDR procedure [5] deals with high throughput experimental data from two studies. For feature i, we have the bivariate observations (x 1i , x 2i ), i = 1, . . . , m. It is assumed that (x 1i , x 2i ), i = 1, . . . , m consist of genuine signals (replicable signals across two studies) and spurious signals (non-replicable signals). Let K i denote whether the ith feature is a replicable signal (K i = 1) or not (K i = 0). It is assumed that K i , i = 1, . . . , m are independent and follow the Bernoulli distribution. Denote π 1 = P (K i = 1). To induce dependence between x 1i and x 2i , we use a copula model. Specifically, we assume that the observed data (x 1i , x 2i ) are generated from latent variables (z 1i , z 2i ). The latent variables Denote the marginal distribution function of x ji , i = 1, . . . , m; j = 1, 2, as F j . Generate In this way, dependence across two studies is produced. To control the false discovery rate, we use the local irreproducible discovery rate (idr) as the test statistic, which is defined as the posterior probability of where The estimation of (π 1 , µ 1 , σ 2 1 , ρ 1 ) and (F 1 , F 2 ) is through the EM algorithm [3]. The step-up procedure based on ordered idr can be used for FDR control [11]. Specifically, let idr (1) ≤ · · · ≤ idr (m) be the ranked idr values, and denote H (1) , . . . , H (m) as the corresponding hypotheses. Find l = max{i : i −1 i j=1 idr j ≤ α}, and reject all H (i) with i = 1, . . . , l.

B.5 The MaRR procedure
The MaRR procedure [6] uses the maximum rank of each feature. The null hypothesis is that H 0i : p 1i and p 2i are irreproducible. Denote (R 1i , R 2i ) as the ranks of (p 1i , p 2i ), i = 1, . . . , m within each study. Define Let π 1 denote the proportion of replicable signals. Under the assumptions: (I1) if gene g is reproducible and gene h is irreproducible (I2) the correlation between the ranks of the reproducible gene is non-negative; (I3) the two ranks of the irreproducible gene are independent, irreproducible ranks R 1i and R 2i are uniformly distributed between ⌊mπ 1 ⌋ + 1 and m. Denote the conditional null survival function of M i /m as where i x = ⌊mx⌋ and j π1 = ⌊mπ 1 ⌋. The limiting conditional survival function of M i /m under the null is The empirical survival function can be estimated byŜ . By strong law of large numbers and Bayesian formula, If we estimate π 1 by i/m, we can define the mean square error (MSE) as follows.
k is chosen to minimize the MSE in the range between 0 and ⌊0.9m⌋.
Thusk/m serves as a good estimation of π 1 . To control the FDR at level α, the MaRR generates the rejection threshold as follows Reject all features associated with M i ≤N . Philtron et al. [6] relax assumption (I1) to (R1): P (R 1g < R 1h ) > 1/2 and P (R 2g < R 2h ) > 1/2, which is more plausible in practice.

B.6 TheŠidák's method
TheŠidák-corrected minimum p-value [7] can be used for meta-analysis. Specifically, we calculate the aggregated p-values across two studies through Assume that p 1i and p 2i , i = 1, . . . , m are independent. Under the null for meta-analysis where p 1i and p 2i follow standard uniform distribution, we compute the cdf of min{p 1i , p 2i }. Specifically, we have follows a standard uniform distribution under the meta-analysis null. Here we use the property that for a standard uniformly distributed random variable U, the cdf of F −1 (U ) is F.
We apply the BH procedure [1] on q S i , i = 1, . . . , m to evaluate the performance ofŠidák's method in replicability analysis.

B.7 The Lancaster's method
Lancaster's method [4] uses different weights for different studies. Denote F χ 2 w j as the cdf of a χ 2 distribution with w j , j = 1, 2 degree of freedom. For the ith hypothesis, Lancaster's method combines information across two studies by a test statistic , which follows a χ 2 distribution with degree of freedom w 1 + w 2 under the null for meta-analysis that both studies are from the null. The p-value for Lancaster's method is computed as the tail probability of the χ 2 distribution with w 1 + w 2 degrees of freedom evaluated at L i . We denote them as q L i , i = 1, . . . , m. We apply the BH procedure [1] on q L i , i = 1, . . . , m to evaluate the performance of Lancaster's method in replicability analysis.

C Realistic simulation studies
We performed realistic simulations based on Replicate 9 and Replicate 12 of the mouse olfactory bulb data measured with ST technology (files 'MOB Replicate 9' and 'MOB Replicate 12' in the Spatial Research Website at https://www.spatialresearch.org/resources-published-datasets/doi-10-1126science-aaf2403/) [8]. The two datasets include 15, 284 genes measured on 237 spatial spots and 16, 034 genes measured on 282 spots, respectively. We filtered out genes that are expressed in less than 10% spatial spots and selected spots with at least ten total read counts, resulting in 9, 547 genes on 236 spots for the Replicate 9 dataset and 9, 904 genes on 279 spots for the Replicate 12 dataset. The spatial expression patterns and parameters used in data generation for each study were inferred from SPARK [10]. We separately generated SRT count data based on the two studies following the simulation design in [10].
In study j (j = 1, 2), for each gene, the count on spot i was generated from where i = 1, . . . , 236 for study 1 and i = 1, . . . , 279 for study 2; N i denotes total counts of all genes on spot i, which is obtained from the mouse olfactory bulb data [8]; λ i represents the relative expression level of the focused gene, which will be generated; β i is the mean value of log λ i ; and ϵ i ∼ N (0, s 2 j ) is the random noise. If the gene in focus is not spatially variable, we set β i across all spatial spots to be constant, which is the median value of intercepts estimated from SPARK [10](−9.94 for study 1 and −9.93 for study 2). If the focused gene is an SVG, we used different β i for spots to exhibit spatial expression patterns. Specifically, we first categorized the spots into two groups based on the three spatial expression patterns in Fig. S1: a group of spots with low expression levels and a group of spots with high expression levels. In the low expression group, we set β i to be the median value of intercepts estimated by SPARK (−9.94 for study 1 and −9.93 for study 2); in the high expression group, we set β i to be two-fold (weak signal strength), three-fold (moderate signal strength) or four-fold (strong signal strength) of the corresponding median value on rate parameter, e.g., e βi = 2 · e a means β i is two-fold of a. Finally, y i was generated from (1) with simulated β i and ϵ i .
Let m = 10, 000, ξ 11 = 0.05 and ξ 01 = ξ 10 . For a given value of ξ 00 , corresponding ξ 01 and ξ 10 can be obtained by ξ 01 = ξ 10 = (1 − ξ 00 − ξ 11 )/2. States of genes in two SRT studies, θ 1i and θ 2i , were generated from a multinomial distribution with probabilities, P(θ 1i = k, θ 2i = l) = ξ kl , k, l ∈ {0, 1}, for pre-specified ξ 00 , ξ 01 , ξ 10 and ξ 11 . After obtaining θ ji for i = 1, . . . , m and j = 1, 2, we simulated gene count matrices based on corresponding ST data and parameters with different signal strengths (moderate or strong) and different standard deviations for the error term (s j = 0.3 or 0.5). Then we applied SPARK [10] on the two count data to get two paired p-values sequences, denoted as (p 1i , p 2i ), i = 1, . . . , m. Methods for replicability analysis are based on the paired p-value sequence. Fig. S2 and Fig. S3 show the FDR control and power comparison of different methods across different settings. We observe that MaxP and JUMP controlled the FDR at the nominal level across all settings, and JUMP is more powerful than MaxP. BH is not valid in practice since it failed to control the FDR in some settings (e.g., ξ 00 = 0.5). The power increased for all methods from Pattern I to Pattern III. By examining the three spatial expression patterns on which the corresponding data were generated ( Supplementary Fig. S1), we speculate that this might be due to the increased spatial variability from Pattern I to Pattern III.

D Computational time
We implemented all methods in R and evaluated the computational time of replicability analysis based on paired p-values. Computations were carried out in an Intel(R) Core(TM) i7-9750H 2.6Hz CPU with 64.0 GB RAM laptop. In the simulation studies, we set µ 1 = µ 2 = 2.5, σ 1 = σ 2 = 1, ξ 11 = 0.9, ξ 01 = ξ 10 = 0.025, and ξ 11 = 0.05. Let m = 10, 000, 20, 000, 50, 000, and 100, 000, respectively. Table S1 summarizes the computational time of different methods to finish one replication with different numbers of genes. We observe that the computation is fast for all methods except MaRR [6] and IDR [5]. JUMP is scalable to hundreds of thousands of genes. The minor extra computational time of JUMP over other valid methods for replicability analysis can be ignored given its substantial power gain in replicability analysis. Table S2 summarizes the data information and computational time for replicability analysis on two pairs of SRT datasets from mouse olfactory bulb and mouse cerebellum.   Figure S1: Spatial expression patterns summarized in two SRT studies on which the realistic simulation studies were performed. The SVGs were identified by SPARK [10] at an FDR level of   Figure S6: Spatial expression patterns of 24 genes randomly selected from the 169 replicable SVGs additionally identified by JUMP in mouse cerebellum. (a) Spatial expression patterns of the 24 randomly selected genes based on the Slide-seq data. (b) Spatial expression patterns of the 24 randomly selected genes based on the Slide-seqV2 data.